xy
We import the csv file “loan_sample_9.csv” and make a copy of it to ensure that we don’t mess up the original dataset.
data_loans <- read_csv("loan_sample_9.csv")
## Rows: 40000 Columns: 17
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (5): grade, home_ownership, verification_status, purpose, application_type
## dbl (12): loan_amnt, int_rate, annual_inc, dti, open_acc, revol_bal, revol_u...
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
data <- data_loans
In the first step we explore the data. We start by investigating the structure of the data set. There are 12 numeric and 5 categorical variables in the dataset. But the numeric variable “Status” with its values “1” and “0” looks like a factor and all the characteristic variables also look like factors.
## # A tibble: 6 x 17
## loan_amnt int_rate grade home_ownership annual_inc verification_status purpose
## <dbl> <dbl> <chr> <chr> <dbl> <chr> <chr>
## 1 6000 18.2 D RENT 90000 Not Verified debt_c~
## 2 8000 13.3 C MORTGAGE 70000 Verified home_i~
## 3 6000 14.0 C MORTGAGE 54000 Source Verified debt_c~
## 4 1500 15.6 D RENT 53000 Not Verified credit~
## 5 7000 10.1 B RENT 65000 Not Verified debt_c~
## 6 5000 12.7 C RENT 37000 Not Verified debt_c~
## # i 10 more variables: dti <dbl>, open_acc <dbl>, revol_bal <dbl>,
## # revol_util <dbl>, total_acc <dbl>, total_rec_int <dbl>,
## # application_type <chr>, tot_cur_bal <dbl>, total_rev_hi_lim <dbl>,
## # Status <dbl>
## # A tibble: 6 x 17
## loan_amnt int_rate grade home_ownership annual_inc verification_status purpose
## <dbl> <dbl> <chr> <chr> <dbl> <chr> <chr>
## 1 2000 8.18 B RENT 47000 Source Verified credit~
## 2 6000 14.5 C RENT 38000 Source Verified debt_c~
## 3 2500 9.93 B OWN 23000 Not Verified other
## 4 16000 19.0 D RENT 60000 Source Verified debt_c~
## 5 7000 9.17 B RENT 34000 Source Verified small_~
## 6 14400 17.0 D MORTGAGE 110000 Source Verified debt_c~
## # i 10 more variables: dti <dbl>, open_acc <dbl>, revol_bal <dbl>,
## # revol_util <dbl>, total_acc <dbl>, total_rec_int <dbl>,
## # application_type <chr>, tot_cur_bal <dbl>, total_rev_hi_lim <dbl>,
## # Status <dbl>
## spc_tbl_ [40,000 x 17] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ loan_amnt : num [1:40000] 6000 8000 6000 1500 7000 ...
## $ int_rate : num [1:40000] 18.2 13.3 14 15.6 10.1 ...
## $ grade : chr [1:40000] "D" "C" "C" "D" ...
## $ home_ownership : chr [1:40000] "RENT" "MORTGAGE" "MORTGAGE" "RENT" ...
## $ annual_inc : num [1:40000] 90000 70000 54000 53000 65000 37000 70000 36000 40000 15000 ...
## $ verification_status: chr [1:40000] "Not Verified" "Verified" "Source Verified" "Not Verified" ...
## $ purpose : chr [1:40000] "debt_consolidation" "home_improvement" "debt_consolidation" "credit_card" ...
## $ dti : num [1:40000] 25.67 6.72 13.16 16.85 2.36 ...
## $ open_acc : num [1:40000] 15 8 9 5 7 6 7 12 8 7 ...
## $ revol_bal : num [1:40000] 10839 690 8057 18382 4124 ...
## $ revol_util : num [1:40000] 28.7 3.4 42.6 85.1 19.3 36 74.1 22.7 60.1 57.4 ...
## $ total_acc : num [1:40000] 28 16 18 18 10 9 7 17 15 10 ...
## $ total_rec_int : num [1:40000] 1153 705 1088 338 142 ...
## $ application_type : chr [1:40000] "Individual" "Individual" "Individual" "Individual" ...
## $ tot_cur_bal : num [1:40000] 90776 199277 148632 23795 4124 ...
## $ total_rev_hi_lim : num [1:40000] 37745 20400 18900 21600 21400 ...
## $ Status : num [1:40000] 0 0 0 0 0 0 0 0 0 0 ...
## - attr(*, "spec")=
## .. cols(
## .. loan_amnt = col_double(),
## .. int_rate = col_double(),
## .. grade = col_character(),
## .. home_ownership = col_character(),
## .. annual_inc = col_double(),
## .. verification_status = col_character(),
## .. purpose = col_character(),
## .. dti = col_double(),
## .. open_acc = col_double(),
## .. revol_bal = col_double(),
## .. revol_util = col_double(),
## .. total_acc = col_double(),
## .. total_rec_int = col_double(),
## .. application_type = col_character(),
## .. tot_cur_bal = col_double(),
## .. total_rev_hi_lim = col_double(),
## .. Status = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
We check the presence of NAs in each of the variables included in the dataset. There are no NAs values in this dataset.
knitr::kable(apply(data, 2, function(x) any(is.na(x))))
| x | |
|---|---|
| loan_amnt | FALSE |
| int_rate | FALSE |
| grade | FALSE |
| home_ownership | FALSE |
| annual_inc | FALSE |
| verification_status | FALSE |
| purpose | FALSE |
| dti | FALSE |
| open_acc | FALSE |
| revol_bal | FALSE |
| revol_util | FALSE |
| total_acc | FALSE |
| total_rec_int | FALSE |
| application_type | FALSE |
| tot_cur_bal | FALSE |
| total_rev_hi_lim | FALSE |
| Status | FALSE |
Now we have 12 numeric and 5 character variables.
overview <- overview(data)
plot(overview)
***
We transform the characteristic variables in factors to count the categories and order them.
data$grade = as.factor(data$grade)
data$home_ownership = as.factor(data$home_ownership)
data$verification_status = as.factor(data$verification_status)
data$purpose = as.factor(data$purpose)
data$application_type = as.factor(data$application_type)
data$Status = as.factor(data$Status)
data <- data %>%
select(order(sapply(., is.factor)),order(sapply(., is.numeric)))
overview <- overview(data)
plot(overview)
***
In most numerical variables there is a large gap between the minimum and maximum. For example, “loan-amnt” (amount of the loan applied for by the borrower) has a minimum of 1,000 and a maximum of 40,000, or “revol_bal” (Total credit revolving balance) from USD 0 to USD 78,762. The average interest rate “int_rate” is around 12.63%, with values between 5.31% and 27.49%. The annual income “annual_inc” of borrowers varies greatly, with an average of around USD 63,277. There are outliers with very high annual salaries. There are borrowers with a dti of 0, which could indicate low indebtedness.
The Variable “purpose” (category provided by the borrower for the loan request) has many categories. They contain the name of the type of loan, except for one group. This group is labeled as “other” and contains 2,283 values. Most loans are used for debt consolidation and credit cards.
The most people are graded between “B” and “C”, in the grades “A” or “B” are similar number of people. The variable “grade” assigned loan grade by the financial service provider.
The most people are in rent or has a mortgage for there home. 3,982 people are home owner. 14,278 people from 40,000 aren’t verified.
We see that 14,278 people are not verifide from 40,000 people. 16,129 are source verifide.
Only 530 joined via App from 40,000 people in the System.
The target variable “Status” is unbalanced, as there are more loans without default (status 0 = 34,794 persons) than with default (status 1 = 5,206).
summary(data)
## loan_amnt int_rate annual_inc dti
## Min. : 1000 Min. : 5.31 Min. : 6600 Min. : 0.00
## 1st Qu.: 7000 1st Qu.: 9.44 1st Qu.: 42000 1st Qu.:12.17
## Median :10050 Median :12.29 Median : 57000 Median :17.67
## Mean :11682 Mean :12.63 Mean : 63277 Mean :18.24
## 3rd Qu.:15125 3rd Qu.:15.05 3rd Qu.: 77000 3rd Qu.:23.89
## Max. :40000 Max. :27.49 Max. :400000 Max. :60.14
##
## open_acc revol_bal revol_util total_acc
## Min. : 1.00 Min. : 0 Min. : 0.00 Min. : 3.00
## 1st Qu.: 8.00 1st Qu.: 5619 1st Qu.: 34.80 1st Qu.:15.00
## Median :10.00 Median : 9760 Median : 52.50 Median :20.00
## Mean :10.29 Mean :11948 Mean : 52.24 Mean :21.27
## 3rd Qu.:13.00 3rd Qu.:15792 3rd Qu.: 70.00 3rd Qu.:27.00
## Max. :23.00 Max. :78762 Max. :123.20 Max. :57.00
##
## total_rec_int tot_cur_bal total_rev_hi_lim grade home_ownership
## Min. : 0.0 Min. : 0 Min. : 400 A: 7274 MORTGAGE:17736
## 1st Qu.: 680.2 1st Qu.: 25136 1st Qu.: 12998 B:13263 OWN : 3982
## Median :1345.5 Median : 53821 Median : 20700 C:11807 RENT :18282
## Mean :1820.6 Mean : 99208 Mean : 24089 D: 7656
## 3rd Qu.:2433.9 3rd Qu.:158638 3rd Qu.: 32000
## Max. :8834.9 Max. :472573 Max. :100000
##
## verification_status purpose application_type
## Not Verified :14278 debt_consolidation:23414 Individual:39470
## Source Verified:16129 credit_card : 9362 Joint App : 530
## Verified : 9593 other : 2283
## home_improvement : 2095
## major_purchase : 807
## medical : 445
## (Other) : 1594
## Status
## 0:34794
## 1: 5206
##
##
##
##
##
In the next step, we investigate our target variable “Status”. We notice also before in our sample, that we have 5,206 persons which did not default on their loan and we have 34,794 which did default.
As we can see in the visualization the data set is highly imbalanced.
ggplot(data, aes(x = Status, fill = Status)) +
geom_bar() +
ylab("Count") +
xlab("Status of the loan")
PercTable(data$Status)
##
## freq perc
##
## 0 34'794 87.0%
## 1 5'206 13.0%
In the next step, we carry-out under sampling and visualizate it again.
set.seed(7)
data_original <- data
data_balanced <- ovun.sample(Status ~ ., data=data, method = "under")
data_under <- data.frame(data_balanced[["data"]])
Visualization of the level of the target variable
We provide a boxplot of the numeric variables in both the original and under-sampled dataset.
knitr::kable(diagnose_outlier(data_under), caption = "Diagnose Outlier", digits = 2)
| variables | outliers_cnt | outliers_ratio | outliers_mean | with_mean | without_mean |
|---|---|---|---|---|---|
| loan_amnt | 237 | 2.28 | 32363.92 | 11919.02 | 11442.71 |
| int_rate | 175 | 1.68 | 25.85 | 13.56 | 13.35 |
| annual_inc | 316 | 3.04 | 163143.59 | 61085.99 | 57891.00 |
| dti | 25 | 0.24 | 47.76 | 18.98 | 18.91 |
| open_acc | 87 | 0.84 | 21.37 | 10.30 | 10.21 |
| revol_bal | 460 | 4.42 | 38905.07 | 11753.72 | 10498.49 |
| revol_util | 0 | 0.00 | NaN | 53.36 | 53.36 |
| total_acc | 47 | 0.45 | 49.19 | 20.94 | 20.82 |
| total_rec_int | 591 | 5.68 | 6653.93 | 1868.94 | 1580.94 |
| tot_cur_bal | 386 | 3.71 | 363182.06 | 91491.63 | 81029.48 |
| total_rev_hi_lim | 275 | 2.64 | 67840.00 | 23042.41 | 21826.89 |
We note that for the variables “annual_inc” (The self-reported annual income provided by the borrower during registration) the visualization changes considerably and there the median also tends to shift strongly.
We do winsorizing for dealing with the highest outliers.
outlier <- function(x){
quantiles <- quantile(x, c(.05, .95))
x[x < quantiles[1]] <- quantiles[1]
x[x > quantiles[2]] <- quantiles[2]
x
}
data_new_under <- map_df(data_under[,-c(12:17)], outlier)
cols <- data_under[,c(12:17)]
data_new_under <- cbind(data_new_under, cols)
boxplot(scale(data_new_under[,c(1:11)]), use.cols = TRUE)
***
ggpairs(data[, c("loan_amnt", "int_rate", "annual_inc", "dti", "total_acc", "total_rec_int", "tot_cur_bal")],
aes(color = as.factor(data$Status)))
plot_ly(data, x = ~loan_amnt, y = ~annual_inc, mode = "markers",
type = "scatter", marker = list(color = "#00CED1")) %>%
layout(title = "Scatter Plot of Loan Amount vs. Annual Income",
xaxis = list(title = "Loan Amount"),
yaxis = list(title = "Annual Income"))